Thanks to my Data Science Professor Hector Corida Bravo and his notes. These helped me immensily for this project.
A popular source of argument among friends, enemies, frenemies, and family members (which could be in any of the previous categories) is the state of gun ownership and gun violence in the United States today. I know I have (cordially of course) discussed this issue a fair bit, especially after the Vegas and Parkland shootings. A common train of disagreement that tends to come from these conversations is the issue of regulation. Some people associated with a particular political leaning argue that more regulation leads to fewer guns which leads to less gun violence. Other people with the opposite political inclination argue that the regulations would only be followed by people who wouldn’t be committing gun crimes anyway, and the regulations would just give wrongdoers more incentive to do wrong without the fear of an armed bystander. However, rather than presenting unbiased information to the public, news sources have decided to join one side of the argument or the other and sensationalize the issues to make more money. On the regulation-is-good side of the argument (henceforth referred to as pro-regulation), there is this article and this more scholarly article from Stanford. On the other, regulation-is-bad side of the argument (henceforth the anti-regulation stance) is this article. Because, as mentioned before, news is sensationalist and we can’t believe anything they say, we will do the data ourselves. We will be getting our information on gun crime from this Kaggle data set and cross referencing it with this data set on gun laws. We will play with weights based on issues that matter to us using this supporting table. Finally, we will adjust our gun crime rates for population using a dataset from the census bureau. After we store this data in R and make it pretty and clean, we will visualize it with multiple charts, perform some regression work, and even use ML to predict future gun violence rates. Of course, I don’t know where this data came from exactly, it could come from people who are as bad as the news sources. Also, we must understand that there are several factors at work in an issue like this one, and just because a correlation was found doesn’t mean there is causation.
In terms of hypothesis testing, our null hypothesis is that there is no relation between gun violence and gun regulation, and our alternative hypothesis is that there is a relation between gun violence and gun regulation. More general hypothesis test information can be found here.
As a final thought, I wrote the preceding without doing any data analysis, so I will be just as surprised by the results of these procedures when I finish running them as you will be when you read this for the first time!
We will be doing this project in the language R in the RStudio. You can download RStudio here. If you want to get some more background information on R and RStudio, visit this page, which has links to various resources for familiarizing yourself with R. I will try to be very thorough here too, however.
Once you have R installed and have familiarized yourself with it to the extent you desire, sign up for a Kaggle account (I used my burner Google account and it was quite easy) and download the Kaggle data set and the data set on gun laws as CSV files (the Kaggle will download as a zipped folder with a CSV in it, you have to choose CSV from a list of options for the other data). You can open these files in Microsoft Excell or another spreadsheet program to inspect the data or look at the raw information in a text editor to get a feel for what R will actually be seeing.
Once you have RStudio open, got to File->New Project and create a directory for this project. This is what I will call the project’s home directory, and all associated files for the project should be put in this folder.
We will load in the CSV for the Kaggle gun violence data first. If you haven’t already put the CSV in the project’s home directory, do so.
#This library has everything and is fantastic, I've used it for
#every R project I've done
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts -------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
#Pulls the data form our CSV and stores it as R's
#data structure, a data frame
violence_full <- read.csv("gun-violence-data_01-2013_03-2018.csv")
#this pipeline selects the first five colums and the
#first six rows of the dataset and prints.
violence_full %>% select(1:5)%>%head()
## incident_id date state city_or_county
## 1 461105 2013-01-01 Pennsylvania Mckeesport
## 2 460726 2013-01-01 California Hawthorne
## 3 478855 2013-01-01 Ohio Lorain
## 4 478925 2013-01-05 Colorado Aurora
## 5 478959 2013-01-07 North Carolina Greensboro
## 6 478948 2013-01-07 Oklahoma Tulsa
## address
## 1 1506 Versailles Avenue and Coursin Street
## 2 13500 block of Cerise Avenue
## 3 1776 East 28th Street
## 4 16000 block of East Ithaca Place
## 5 307 Mourning Dove Terrace
## 6 6000 block of South Owasso
As a note, all code will be commented with a max of two lines, and if I want to expound on my notes I will do it after the code block.
For example, the last line is what is called a pipeline. The ‘%>%’ operator takes the result of the thing in front of it and passes it as the first argument of the thing behind it. These pipes usually start with a dataset, in this case violence_full. Select picks columns, and this range means columns 1-5 inclusive. Head selects the first few rows of a dataset. If a line does nothing but create a value or mention a value without storing it, that value is printed.
Also, if you have not already installed a library, go to the packages tab (by default in the bottom right pane in RStudio), click the “install” button, type in the name of the package you are looking for, and click “Install.”
Now we will add the data set on gun laws and the census dataset. The government wasn’t kind enough to provide us with a CSV, so we will have to do a slightly different procedure for the .xlsx file they do provide.
#Pulls the data form our CSV and stores it as R's
#data structure, a data frame
gunlaws_full <- read.csv("raw_data.csv")
#this pipeline selects the first five colums and the
#first six rows of the dataset and prints.
gunlaws_full %>% select(1:5)%>%head()
## state year age18longgunpossess age18longgunsale age21handgunpossess
## 1 Alabama 2017 0 0 0
## 2 Alaska 2017 0 1 0
## 3 Arizona 2017 1 0 0
## 4 Arkansas 2017 0 0 0
## 5 California 2017 0 1 0
## 6 Colorado 2017 0 0 0
#library for reading .xlsx
library(xlsx)
## Loading required package: rJava
## Loading required package: xlsxjars
#the second argument in read.xlsx is the sheet index
#it can also be the sheet name
pops_full <- read.xlsx("nst-est2017-01.xlsx", 1)
pops_full %>% select(1:5)%>%head()
## table.with.row.headers.in.column.A.and.column.headers.in.rows.3.through.4...leading.dots.indicate.sub.parts.
## 1 Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2017
## 2 Geographic Area
## 3 <NA>
## 4 United States
## 5 Northeast
## 6 Midwest
## NA. NA..1 NA..2 NA..3
## 1 <NA> <NA> <NA> NA
## 2 40269 <NA> Population Estimate (as of July 1) NA
## 3 Census Estimates Base 2010 2011
## 4 308745538 308758105 309338421 311644280
## 5 55317240 55318350 55388349 55642659
## 6 66927001 66929794 66973360 67141501
You will note that the dataframe for pops_full has a lot of junk in it. If you open it and the CSV for gun laws in Excell and compare them, you will notice that the .xlsx contains a lot more data than the gun laws CSV, and that most of this data will not be needed. We’ll deal with that in a bit.
If you want more information about reading in weird datafiles, check out this link.
If you look at the website for the categories, you’ll notice that there is no option to download the data. We will have to scrape it. As outlined in that link, we will have to tell R what sort of element it is looking for to collect the data from. If you don’t want to use the chrome extension the link talks about, you will have to inspect the element and select the class that that element is part of. You will put that in the argument of html_nodes. This will return a list of matches; you will have to select the one you want. In my case I wanted the first match it found.
#the library for scraping
library(rvest)
## Loading required package: xml2
##
## Attaching package: 'rvest'
## The following object is masked from 'package:purrr':
##
## pluck
## The following object is masked from 'package:readr':
##
## guess_encoding
#read in webpage.
#usually a web address
webpage <- read_html("State Firearm Laws - Use the Database Glossary Search Tool.html")
cats_full <- webpage %>%
html_nodes(".table") %>%
.[[1]] %>%
html_table(fill = TRUE)
head(cats_full)
## Code
## 1 age18longgunpossess
## 2 age18longgunsale
## 3 age21handgunpossess
## 4 age21handgunsale
## 5 age21longgunpossess
## 6 age21longgunsale
## Definition
## 1 No possession of long guns until age 18
## 2 Purchase of long guns from licensed dealers and private sellers restricted to age 18 and older
## 3 No possession of handguns until age 21
## 4 Purchase of handguns from licensed dealers and private sellers restricted to age 21 and older
## 5 No possession of long guns until age 21
## 6 Purchase of long guns from licensed dealers and private sellers restricted to age 21 and older
## Category/Subcategory
## 1 Possession regulationsAge restrictions
## 2 Buyer regulationsAge restrictions
## 3 Possession regulationsAge restrictions
## 4 Buyer regulationsAge restrictions
## 5 Possession regulationsAge restrictions
## 6 Buyer regulationsAge restrictions
One snag I ran into here is that the page is dynamically loaded, so it only worked when I right clicked on the page and selected “Save page as…” and put the result in my project’s home direcory.
As mentioned before, there is a lot of junk in the pops_full dataset. We only want the data and done of the fluff. I’ll remind you what this dataset looks like:
head(pops_full)
## table.with.row.headers.in.column.A.and.column.headers.in.rows.3.through.4...leading.dots.indicate.sub.parts.
## 1 Table 1. Annual Estimates of the Resident Population for the United States, Regions, States, and Puerto Rico: April 1, 2010 to July 1, 2017
## 2 Geographic Area
## 3 <NA>
## 4 United States
## 5 Northeast
## 6 Midwest
## NA. NA..1 NA..2 NA..3
## 1 <NA> <NA> <NA> NA
## 2 40269 <NA> Population Estimate (as of July 1) NA
## 3 Census Estimates Base 2010 2011
## 4 308745538 308758105 309338421 311644280
## 5 55317240 55318350 55388349 55642659
## 6 66927001 66929794 66973360 67141501
## NA..4 NA..5 NA..6 NA..7 NA..8 NA..9
## 1 NA NA NA NA NA NA
## 2 NA NA NA NA NA NA
## 3 2012 2013 2014 2015 2016 2017
## 4 313993272 316234505 318622525 321039839 323405935 325719178
## 5 55860261 56047732 56203078 56296628 56359360 56470581
## 6 67318295 67534451 67720120 67839187 67978168 68179351
You’ll notice that we only want a few of the columns (called atributes): the ones with the state names and the dates. You will also notice that there are some extra rows (called entities). The following code trims them down.
#library for text manipulation
library(stringr)
#the arrow is a storage operator
pops_trimmed <- pops_full %>%
#removes attributes 2 and 3
select(-(2:3)) %>%
#removes first 8 entities
slice(-(1:8)) %>%
#removes last 6 entities
slice(-(52:58) ) %>%
#change the column names (temporarily for ease of use)
#the ticks let us use code we wouldn't usually put in a pipeline
`colnames<-`(c("State", "2010", "2011", "2012", "2013",
"2014", "2015", "2016", "2017")) %>%
#remove the first character in the string for State
mutate(State = str_sub(State, 2, 25)) %>%
#change 2010 (ticks because its a number) to numeric data
transform(`2010` = as.double(as.character(`2010`))) %>%
#change the column names (again...)
#try without this to see what happens.
`colnames<-`(c("State", "2010", "2011", "2012", "2013",
"2014", "2015", "2016", "2017"))
head(pops_trimmed)
## State 2010 2011 2012 2013 2014 2015
## 1 Alabama 4785579 4798649 4813946 4827660 4840037 4850858
## 2 Alaska 714015 722259 730825 736760 736759 737979
## 3 Arizona 6407002 6465488 6544211 6616124 6706435 6802262
## 4 Arkansas 2921737 2938640 2949208 2956780 2964800 2975626
## 5 California 37327690 37672654 38019006 38347383 38701278 39032444
## 6 Colorado 5048029 5116411 5186330 5262556 5342311 5440445
## 2016 2017
## 1 4860545 4874747
## 2 741522 739795
## 3 6908642 7016270
## 4 2988231 3004279
## 5 39296476 39536653
## 6 5530105 5607154
You will notice that I changed the state names to remove the periods, changed the attribute names, and changed the 2010 attribute to neumaric (helpful later, when we want them to be numbers: it was treating it as text). For more information on string manipulation, check this link.
However, we’re not done. We want to be able to join the violence dataset to this based on year and state eventually. This dataset as it stands is not very condusive to this, it would be better if there was a separate attribute for each state in each year. This also plays into the idea of tidy data (incidentally, this seems to be the paper Professor Bravo used for his class notes–might want to cite this Professor).
#gather(dataset, what attribute names turns into, what the data turns into,
# what not to include)
pops_tidy <- gather(pops_trimmed, year, population, -State)
head(pops_tidy)
## State year population
## 1 Alabama 2010 4785579
## 2 Alaska 2010 714015
## 3 Arizona 2010 6407002
## 4 Arkansas 2010 2921737
## 5 California 2010 37327690
## 6 Colorado 2010 5048029
Also, the violence_full dataset has more attributes than we need, and the date attribute is not seen by R as a date. To use it effectivly as a date, we would have to “date” attribute to a date attribute. This involves some fanciness, and we really don’t need the full date for this project, so we won’t bother. We will create a year attribute for use when joining with the population data for analysis instead.
violence_trimmed <- violence_full %>%
#select the 4 attributes we want
select(date, state, n_killed, n_injured, longitude, latitude) %>%
mutate(date = str_sub(date, 1, 4))
head(violence_trimmed)
## date state n_killed n_injured longitude latitude
## 1 2013 Pennsylvania 0 4 -79.8559 40.3467
## 2 2013 California 1 3 -118.3330 33.9090
## 3 2013 Ohio 1 3 -82.1377 41.4455
## 4 2013 Colorado 4 0 -104.8020 39.6518
## 5 2013 North Carolina 2 2 -79.9569 36.1140
## 6 2013 Oklahoma 4 0 -95.9768 36.2405
This dataset is really cool and I regret chopping out this much data. I am sure someone particularly motivated could do some really cool stuff with all the data this thing has. If you haven’t looked more closely at it yet, do so!
Because we are taking so long to get to our final data, and because we now have a hyper cool dataset with latitude and longitude, we are going to make cool maps! Hooray!
One thing about this is, as a tangent, I will not be going into extreme detail about this section.
Here, we will make an ultra-cool layered map, with a layer per year of data, showing the location of the reports of gun violence.
#the cool library for making maps
library(leaflet)
## Warning: package 'leaflet' was built under R version 3.4.4
violence_map <- na.omit(violence_trimmed)
markermap <- leaflet(violence_map) %>%
addTiles() %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2013)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2013)%>%select(latitude)))},
radius = 2,
color = 'red',
group="2013")%>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2014)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2014)%>%select(latitude)))},
radius = 2,
color = 'orange',
group="2014") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2015)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2015)%>%select(latitude)))},
radius = 2,
color = 'green',
group="2015") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2016)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2016)%>%select(latitude)))},
radius = 2,
color = 'blue',
group="2016") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2017)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2017)%>%select(latitude)))},
radius = 2,
color = 'purple',
group="2017") %>%
addCircleMarkers(~{as.numeric(unlist(violence_map%>%filter(date==2018)%>%select(longitude)))},
~{as.numeric(unlist(violence_map%>%filter(date==2018)%>%select(latitude)))},
radius = 2,
color = 'black',
group="2018") %>%
addLayersControl(
overlayGroups = c("2013", "2014", "2015", "2016", "2017", "2018"),
options = layersControlOptions(collapsed = FALSE)
) %>%
hideGroup("2014") %>%
hideGroup("2015") %>%
hideGroup("2016") %>%
hideGroup("2017") %>%
hideGroup("2018")
markermap
You can find more about the leaflet here.